24 research outputs found

    Developing and applying heterogeneous phylogenetic models with XRate

    Get PDF
    Modeling sequence evolution on phylogenetic trees is a useful technique in computational biology. Especially powerful are models which take account of the heterogeneous nature of sequence evolution according to the "grammar" of the encoded gene features. However, beyond a modest level of model complexity, manual coding of models becomes prohibitively labor-intensive. We demonstrate, via a set of case studies, the new built-in model-prototyping capabilities of XRate (macros and Scheme extensions). These features allow rapid implementation of phylogenetic models which would have previously been far more labor-intensive. XRate's new capabilities for lineage-specific models, ancestral sequence reconstruction, and improved annotation output are also discussed. XRate's flexible model-specification capabilities and computational efficiency make it well-suited to developing and prototyping phylogenetic grammar models. XRate is available as part of the DART software package: http://biowiki.org/DART .Comment: 34 pages, 3 figures, glossary of XRate model terminolog

    Accurate reconstruction of insertion-deletion histories by statistical phylogenetics

    Get PDF
    The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.Comment: 28 pages, 15 figures. arXiv admin note: text overlap with arXiv:1103.434

    Accurate Detection of Recombinant Breakpoints in Whole-Genome Alignments

    Get PDF
    We propose a novel method for detecting sites of molecular recombination in multiple alignments. Our approach is a compromise between previous extremes of computationally prohibitive but mathematically rigorous methods and imprecise heuristic methods. Using a combined algorithm for estimating tree structure and hidden Markov model parameters, our program detects changes in phylogenetic tree topology over a multiple sequence alignment. We evaluate our method on benchmark datasets from previous studies on two recombinant pathogens, Neisseria and HIV-1, as well as simulated data. We show that we are not only able to detect recombinant regions of vastly different sizes but also the location of breakpoints with great accuracy. We show that our method does well inferring recombination breakpoints while at the same time maintaining practicality for larger datasets. In all cases, we confirm the breakpoint predictions of previous studies, and in many cases we offer novel predictions

    Statistical phylogenetic methods with applications to virus evolution

    No full text
    This thesis explores methods for computational comparative modeling of genetic sequences. The framework within which this modeling is undertaken is that of sequence alignments and associated phylogenetic trees. The first part explores methods for building ancestral sequence alignments making explicit use of phylogenetic likelihood functions. New capabilities of an existing MCMC alignment sampler are discussed in detail, and the sampler is used to analyze a set of HIV/SIV gp120 proteins. An approximate maximum-likelihood alignment method is presented, first in a tutorial-style format and later in precise mathematical terms. An implementation of this method is evaluated alongside leading alignment programs. The second part describes methods utilizing multiple sequence alignments. First, mutation rate is used to predict positional mutational sensitivities for a protein. Second, the flexible, automated model-specication capabilities of the XRate software are presented. The final chapter presents recHMM, a method to detect recombination among sequence by use of a phylogenetic hidden Markov model with a tree in each hidden state

    The model used by PhastCons, a 3-nonterminal HMM with rate multipliers, is compactly expressed by XRate's macro language.

    No full text
    <p>Different nonterminal have different evolutionary rates, but they all share the same underlying substitution model. Transition probabilities are shared: a transition between nonterminals happens with probability <i>leaveProb</i>, and self-transitions happen with probability <i>stayProb</i>. This model (with any number of nonterminals) can be expressed in XRate's macro language in approximately 20 lines of code.</p

    A schematic of a DLESS-style phylo-HMM: each node of the tree has its own nonterminal, such that the node-rooted subtree evolves at a slower rate than the rest of the tree.

    No full text
    <p>Inferring the pattern of hidden nonterminals generating an alignment allows for detecting regions of lineage-specific selection. Expressing this model compactly in XRate 's macro language allows it to be used with any input tree without having to write data-specific code or use external model-generating scripts.</p

    Data from several XRate analyses, shown alongside genes (A) and known RNA structures (B) in <i>poliovirus</i>.

    No full text
    <p>XDecoder (<b>C</b>) recovers all known structures with high posterior probability and predicts a promising target for experimental probing (region 6800–7100). XDecoder was run on an alignment of 27 <i>poliovirus</i> sequences with the results visualized as a track in JBrowse <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0036898#pone.0036898-Skinner1" target="_blank">[32]</a> via a wiggle file. Alongside XDecoder probabilities are the three signals which XDecoder aims to disentangle: (<b>D</b>) conservation, (<b>E</b>) coding potential, and (<b>F</b>) RNA structure. Paradoxically, the CRE and RNase-L inhibition elements show both conservation and coding sequence preservation, whereas PFOLD's predictions show only a slight increase in probability density around the known structures. XDecoder is the only grammar which returns predictions of reasonable specificity. The full JBrowse instance is included as Text S 2.</p